Introduction to Data Science in Python
Lecturer: Hillary Green-Lerman
1 Course Description
Begin your journey into Data Science! Even if you’ve never written a line of code in your life, you’ll be able to follow this course and witness the power of Python to perform Data Science. You’ll use data to solve the mystery of Bayes, the kidnapped Golden Retriever, and along the way you’ll become familiar with basic Python syntax and popular Data Science modules like matplotlib (for charts and graphs) and pandas (for tabular data).
Course materials can be found here.
2 Getting Started in Python
Welcome to the wonderful world of Data Analysis in Python! In this chapter, you’ll learn the basics of Python syntax, load your first Python modules, and use functions to get a suspect list for the kidnapping of Bayes, DataCamp’s prize-winning Golden Retriever.
2.1 Lecture: Dive into Python
2.2 Importing Python Modules
Modules (sometimes called packages or libraries) help group together related sets of tools in Python. Below are sample imports of modules that are frequently used by Data Scientists:
- statsmodels: used in machine learning; usually aliased as sm;
- seaborn: a visualization library; usually aliased as sns;
- numpy: performs math operations; usually aliased as np.
Note that each module has a standard alias, which allows you to access the tools inside of the module without typing as many characters. For example, aliasing lets us shorten seaborn.scatterplot() to sns.scatterplot().
Great job! You’ve learned to import three important machine learning modules!
2.3 Lecture: Creating Variables
2.4 Creating Numbers & Strings
Before we start looking for Bayes’ kidnapper, we need to fill out a Missing Puppy Report with details of the case. Each piece of information will be stored as a variable.
We define a variable using an equals sign (\(=\)). For instance,
## <class 'str'>
## 'DataCamp'
## height || 24 || <class 'int'>
## bayes_age || 4.0 || <class 'float'>
Notes: it’s easy to make errors when you’re trying to type strings quickly.
- Don’t forget to use quotes! Without quotes, you’ll get a name error.
- Use the same type of quotation mark. If you start with a single quote, and end with a double quote, you’ll get a syntax error.
2.5 Lecture: Fun with Functions
2.6 Load a DataFrame
A ransom note was left at the scene of Bayes’ kidnapping. Eventually, we’ll want to analyze the frequency with which each letter occurs in the note, to help us identify the kidnapper. For now, we just need to load the data from ransom.csv into Python. The data can be found here.
We’ll load the data into a DataFrame, a special data type from the pandas module. It represents spreadsheet-like data (something with rows and columns).
We can create a DataFrame from a CSV (comma-separated value) file by using the function pd.read_csv().
# Import pandas
import pandas as pd
# Load the 'ransom.csv' into a DataFrame
url = 'https://raw.githubusercontent.com/QuanNguyenIU/QuanNguyenIU.github.io/main/DataCamp/Python/Intro.%20to%20Data%20Science%20in%20Python/ransom.csv'
ransom = pd.read_csv(url)
# Display DataFrame
ransom.head()## letter_index letter frequency
## 0 1 A 7.38
## 1 2 B 1.09
## 2 3 C 2.46
## 3 4 D 4.10
## 4 5 E 12.84
Great job! You now have data that will eventually help you find Bayes’ kidnapper!
3 Loading Data in Pandas
In this chapter, you’ll learn a powerful Python libary: pandas. It lets you read, modify, and search tabular datasets (like spreadsheets and database tables). You’ll examine credit card records for the suspects and see if any of them made suspicious purchases.
3.1 Lecture: What is pandas?
3.2 Loading a DataFrame
We’re still working hard to solve the kidnapping of Bayes, the Golden Retriever. Assume that we have narrowed the list of suspects to:
- Fred Frequentist;
- Ronald Aylmer Fisher;
- Gertrude Cox;
- Kirstine Smith.
We’ve obtained credit card records for all four suspects. Perhaps some of them made suspicious purchases before the kidnapping?
The records are in a CSV called credit_records.csv. The data can be found here.
# Load credit_records.csv
url = 'https://raw.githubusercontent.com/QuanNguyenIU/QuanNguyenIU.github.io/main/DataCamp/Python/Intro.%20to%20Data%20Science%20in%20Python/credit_records.csv'
credit_records = pd.read_csv(url)
# Display the first three rows of credit_records
credit_records.head(3)## suspect location date item price
## 0 Fred Frequentist Petroleum Plaza January 1, 2018 gas 24.95
## 1 Fred Frequentist Groceries R Us January 10, 2018 cheese 5.00
## 2 Fred Frequentist Petroleum Plaza January 10, 2018 fizzy drink 1.90
What do you notice about the credit records?
3.3 Inspecting a DataFrame
We’ve loaded the credit card records of our four suspects into a DataFrame called credit_records. Let’s learn more about the structure of this DataFrame. How many rows are in credit_records?
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 104 entries, 0 to 103
## Data columns (total 5 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 suspect 104 non-null object
## 1 location 104 non-null object
## 2 date 104 non-null object
## 3 item 104 non-null object
## 4 price 104 non-null float64
## dtypes: float64(1), object(4)
## memory usage: 4.2+ KB
3.4 Lecture: Selecting Columns
3.5 Two Methods for Selecting Columns
Once again, we’ve loaded the credit card records of our four suspects into a DataFrame called credit_records. Let’s examine the items that they’ve purchased.
# Select the column item from credit_records
# Use brackets and string notation
credit_records["item"]## 0 gas
## 1 cheese
## 2 fizzy drink
## 3 carwash
## 4 pants
## ...
## 99 burger
## 100 cheeseburger
## 101 cheeseburger
## 102 gas
## 103 gas
## Name: item, Length: 104, dtype: object
## 0 gas
## 1 cheese
## 2 fizzy drink
## 3 carwash
## 4 pants
## ...
## 99 burger
## 100 cheeseburger
## 101 cheeseburger
## 102 gas
## 103 gas
## Name: item, Length: 104, dtype: object
Great job! Notice that both notations returned the same output.
Another junior detective is examining a DataFrame of Missing Puppy Reports. The data can be found here.
url = 'https://raw.githubusercontent.com/QuanNguyenIU/QuanNguyenIU.github.io/main/DataCamp/Python/Intro.%20to%20Data%20Science%20in%20Python/mpr.csv'
mpr = pd.read_csv(url)
# Use info() to inspect mpr
print(mpr.info())## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 6 entries, 0 to 5
## Data columns (total 5 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 Dog Name 6 non-null object
## 1 Owner Name 5 non-null object
## 2 Dog Breed 6 non-null object
## 3 Status 6 non-null object
## 4 Age 6 non-null int64
## dtypes: int64(1), object(4)
## memory usage: 368.0+ bytes
## None
# Select column "Dog Name" from mpr
name = mpr["Dog Name"]
# Select column "Status" from mpr
is_missing = mpr["Status"]
# Display the columns
print(name, is_missing)## 0 Bayes
## 1 Sigmoid
## 2 Sparky
## 3 Theorem
## 4 Ned
## 5 Benny
## Name: Dog Name, dtype: object 0 Still Missing
## 1 Still Missing
## 2 Found
## 3 Found
## 4 Still Missing
## 5 Found
## Name: Status, dtype: object
3.6 Lecture: Selecting Rows with Logic
3.7 Logical Testing
Let’s practice writing logical statements and displaying the output.
Recall that we use the following operators:
- \(==\) tests that two values are equal;
- \(!=\) tests that two values are not equal;
- \(>\) and \(<\) test that greater than or less than, respectively;
- \(>=\) and \(<=\) test greater than or equal to or less than or equal to, respectively.
The variable height_inches represents the height of a suspect. Is height_inches greater than 70 inches?
## False
The variable plate1 represents a license plate number of a suspect. Is it equal to FRQ123?
## True
The variable fur_color represents the color of Bayes’ fur. Is fur_color equal to “brown”?
## True
Great job! Let’s use these logical statements to select some rows!
3.8 Selecting Missing Puppies
Let’s return to our DataFrame of missing puppies, which is loaded as mpr. Let’s select a few different rows to learn more about the other missing dogs.
## Dog Name Owner Name Dog Breed Status Age
## 2 Sparky Dr. Apache Border Collie Found 3
## 3 Theorem Joseph-Louis Lagrange French Bulldog Found 4
## 5 Benny Hillary Green-Lerman Poodle Found 3
## Dog Name Owner Name Dog Breed Status Age
## 0 Bayes DataCamp Golden Retriever Still Missing 1
## 1 Sigmoid NaN Dachshund Still Missing 2
## 4 Ned Tim Oliphant Shih Tzu Still Missing 2
## Dog Name Owner Name Dog Breed Status Age
## 0 Bayes DataCamp Golden Retriever Still Missing 1
## 1 Sigmoid NaN Dachshund Still Missing 2
## 2 Sparky Dr. Apache Border Collie Found 3
## 3 Theorem Joseph-Louis Lagrange French Bulldog Found 4
## 4 Ned Tim Oliphant Shih Tzu Still Missing 2
Great job! Now that you’re familiar with selecting rows, let’s examine the credit report data!
3.9 Narrowing the List of Suspects
Recall the list of suspects that might have kidnapped Bayes:
- Fred Frequentist;
- Ronald Aylmer Fisher;
- Gertrude Cox;
- Kirstine Smith.
We’d like to narrow this list down, so we obtained credit card records for each suspect. We’d like to know if any of them recently purchased dog treats to use in the kidnapping. If they did, they would have visited ‘Pet Paradise’.
The credit records have been loaded into a DataFrame called credit_records.
## suspect location date item price
## 8 Fred Frequentist Pet Paradise January 14, 2018 dog treats 8.75
## 9 Fred Frequentist Pet Paradise January 14, 2018 dog collar 12.25
## 28 Gertrude Cox Pet Paradise January 13, 2018 dog chew toy 5.95
## 29 Gertrude Cox Pet Paradise January 13, 2018 dog treats 8.75
Both Fred Frequentist and Gertrude Cox purchased dog treats. Perhaps they were trying to lure Bayes into their trap?
4 Plotting Data with Matplotlib
Get ready to visualize your data! You’ll create line plots with another Python module: Matplotlib. Using line plots, you’ll analyze the letter frequencies from the ransom note and several handwriting samples to determine the kidnapper.
4.1 Lecture: Creating Line Plots
4.2 Working Hard
Several police officers have been working hard to help us solve the mystery of Bayes, the kidnapped Golden Retriever. Their commanding officer wants to know exactly how hard each officer has been working on this case. Officer Deshaun has created a DataFrame called deshaun to track the amount of time he spent working on this case. The DataFrame contains two columns:
- day_of_week: a string representing the day of the week;
- hours_worked: the number of hours that a particular officer worked on the Bayes case.
The data can be found here.
url = 'https://raw.githubusercontent.com/QuanNguyenIU/QuanNguyenIU.github.io/main/DataCamp/Python/Intro.%20to%20Data%20Science%20in%20Python/deshaun.csv'
deshaun = pd.read_csv(url)
# From matplotlib, import pyplot under the alias plt
from matplotlib import pyplot as plt
# Plot Officer Deshaun's hours_worked vs. day_of_week
plt.plot(deshaun.day_of_week, deshaun.hours_worked)
# Display Deshaun's plot
plt.show()Great job! It seems like Deshaun works a lot on Monday and Friday, but not so much on Wednesday. In the next exercise, you’ll compare Deshaun’s work to his coworkers’ hours.
4.3 Or Hardly Working?
Two other officers have been working with Deshaun to help find Bayes. Their names are Officer Mengfei and Officer Aditya. Deshaun used their time cards to create two more DataFrames: mengfei and aditya. Let’s plot all three lines together to see who was working hard each day.
url = 'https://raw.githubusercontent.com/QuanNguyenIU/QuanNguyenIU.github.io/main/DataCamp/Python/Intro.%20to%20Data%20Science%20in%20Python/aditya.csv'
aditya = pd.read_csv(url)
url = 'https://raw.githubusercontent.com/QuanNguyenIU/QuanNguyenIU.github.io/main/DataCamp/Python/Intro.%20to%20Data%20Science%20in%20Python/mengfei.csv'
mengfei = pd.read_csv(url)
# Plot Officer Deshaun's hours_worked vs. day_of_week
plt.plot(deshaun.day_of_week, deshaun.hours_worked)
# Plot Officer Aditya's hours_worked vs. day_of_week
plt.plot(aditya.day_of_week, aditya.hours_worked)
# Plot Officer Mengfei's hours_worked vs. day_of_week
plt.plot(mengfei.day_of_week, mengfei.hours_worked)
# Display all three line plots
plt.show()The orange line has no hours worked on Thursday or Friday. But who does the orange represent? Let’s learn how to add a legend to help and continue to work on the mystery of who kidnapped Bayes.
4.4 Lecture: Adding Text to Plots
4.5 Adding a Legend
Officers Deshaun, Mengfei, and Aditya have all been working with you to solve the kidnapping of Bayes. Their supervisor wants to know how much time each officer has spent working on the case.
Deshaun created a plot of data from the DataFrames deshaun, mengfei, and aditya previously. Now he wants to add a legend to distinguish the three lines.
# Officer Deshaun
plt.plot(deshaun.day_of_week, deshaun.hours_worked, label = 'Deshaun')
# Add a label to Aditya's plot
plt.plot(aditya.day_of_week, aditya.hours_worked, label = 'Aditya')
# Add a label to Mengfei's plot
plt.plot(mengfei.day_of_week, mengfei.hours_worked, label = 'Mengfei')
# Add a command to make the legend display
plt.legend()
# Display plot
plt.show()Great job! The Mengfei’s line has no hours worked on Monday and Tuesday. Let’s add some labels to this graph so that we can share it with Deshaun’s supervisor.
4.6 Adding Labels
If we give a chart with no labels to Officer Deshaun’s supervisor, she won’t know what the lines represent.
We need to add labels to Officer Deshaun’s plot of hours worked.
# Lines
plt.plot(deshaun.day_of_week, deshaun.hours_worked, label = 'Deshaun')
plt.plot(aditya.day_of_week, aditya.hours_worked, label = 'Aditya')
plt.plot(mengfei.day_of_week, mengfei.hours_worked, label = 'Mengfei')
# Add a title
plt.title('Hours Worked per Days of Week')
# Add y-axis label
plt.ylabel('Hours Worked')
# Legend
plt.legend()
# Display plot
plt.show()4.7 Adding Floating Text
Officer Deshaun is examining the number of hours that he worked over the past six months. The data can be found here.
url = 'https://raw.githubusercontent.com/QuanNguyenIU/QuanNguyenIU.github.io/main/DataCamp/Python/Intro.%20to%20Data%20Science%20in%20Python/six_months.csv'
six_months = pd.read_csv(url)
six_months## month hours_worked
## 0 Jan 160
## 1 Feb 185
## 2 Mar 182
## 3 Apr 195
## 4 Jun 50
The number for June is low because he only had data for the first week. Let’s help Deshaun by adding an annotation to the graph to explain this.
# Create plot
plt.plot(six_months.month, six_months.hours_worked)
# Add annotation "Missing June data" at (2.5, 80)
plt.text(2.5, 80, "Missing June data")
# Display graph
plt.show()Great job! The graph would have been confusing without that extra information.
4.8 Lecture: Styling Graphs
4.9 Tracking Crime Statistics
Sergeant Laura wants to do some background research to help her better understand the cultural context for Bayes’ kidnapping. She has plotted Burglary rates in three U.S. cities using data from the Uniform Crime Reporting Statistics. The data can be found here.
url = 'https://raw.githubusercontent.com/QuanNguyenIU/QuanNguyenIU.github.io/main/DataCamp/Python/Intro.%20to%20Data%20Science%20in%20Python/data.csv'
data = pd.read_csv(url)Remember:
- You can change linestyle to dotted (‘:’), dashed(‘–’), or no line (’’);
- You can change the marker to circle (‘o’), diamond(‘d’), or square (‘s’).
plt.plot(data["Year"], data["Phoenix Police Dept"],
label = "Phoenix", color = "DarkCyan")
plt.plot(data["Year"], data["Los Angeles Police Dept"],
label = "Los Angeles", linestyle = ':')
plt.plot(data["Year"], data["Philadelphia Police Dept"],
label = "Philadelphia", marker = 's')
plt.legend()
plt.show()Great job! This was a lot of work. Perhaps we can make this easier by setting a global style.
4.10 Playing with Styles
Changing the plotting style is a fast way to change the entire look of your plot without having to update individual colors or line styles. Some popular styles include:
- ‘fivethirtyeight’ - Based on the color scheme of the popular website;
- ‘grayscale’ - Great for when you don’t have a color printer!
- ‘seaborn’ - Based on another Python visualization library;
- ‘classic’ - The default color scheme for matplotlib.
# Change the style to fivethirtyeight
plt.style.use('fivethirtyeight')
# Plot lines
plt.plot(data["Year"], data["Phoenix Police Dept"], label = "Phoenix")
plt.plot(data["Year"], data["Los Angeles Police Dept"], label = "Los Angeles")
plt.plot(data["Year"], data["Philadelphia Police Dept"], label = "Philadelphia")
# Add a legend
plt.legend()
# Display the plot
plt.show()# Change the style to ggplot
plt.style.use('ggplot')
# Plot lines
plt.plot(data["Year"], data["Phoenix Police Dept"], label = "Phoenix")
plt.plot(data["Year"], data["Los Angeles Police Dept"], label = "Los Angeles")
plt.plot(data["Year"], data["Philadelphia Police Dept"], label = "Philadelphia")
# Add a legend
plt.legend()
# Display the plot
plt.show()## ['Solarize_Light2', '_classic_test_patch', '_mpl-gallery', '_mpl-gallery-nogrid', 'bmh', 'classic', 'dark_background', 'fast', 'fivethirtyeight', 'ggplot', 'grayscale', 'seaborn-v0_8', 'seaborn-v0_8-bright', 'seaborn-v0_8-colorblind', 'seaborn-v0_8-dark', 'seaborn-v0_8-dark-palette', 'seaborn-v0_8-darkgrid', 'seaborn-v0_8-deep', 'seaborn-v0_8-muted', 'seaborn-v0_8-notebook', 'seaborn-v0_8-paper', 'seaborn-v0_8-pastel', 'seaborn-v0_8-poster', 'seaborn-v0_8-talk', 'seaborn-v0_8-ticks', 'seaborn-v0_8-white', 'seaborn-v0_8-whitegrid', 'tableau-colorblind10']
Great job! With this background information, you’re ready to finally find the kidnapper.
4.11 Identifying Bayes’ Kidnapper
We’ve narrowed the possible kidnappers down to two suspects:
- Fred Frequentist;
- Gertrude Cox.
The kidnapper left a long ransom note containing several unusual phrases. Let’s use a line plot to compare the frequency of letters in the ransom note to samples from the two main suspects.
Two more DataFrames have been loaded, beside ransom:
- suspect1 contains the letter frequencies for the sample from Fred Frequentist;
- suspect2 contains the letter frequencies for the sample from Gertrude Cox.
Each DataFrame contain two columns letter and frequency.
url = 'https://raw.githubusercontent.com/QuanNguyenIU/QuanNguyenIU.github.io/main/DataCamp/Python/Intro.%20to%20Data%20Science%20in%20Python/suspect1.csv'
suspect1 = pd.read_csv(url)
url = 'https://raw.githubusercontent.com/QuanNguyenIU/QuanNguyenIU.github.io/main/DataCamp/Python/Intro.%20to%20Data%20Science%20in%20Python/suspect2.csv'
suspect2 = pd.read_csv(url)
# Plot each line
plt.plot(ransom.letter, ransom.frequency,
label = 'Ransom', linestyle = ':', color = 'gray')
plt.plot(suspect1.letter, suspect1.frequency, label = 'Fred Frequentist')
plt.plot(suspect2.letter, suspect2.frequency, label = 'Gertrude Cox')
# Add x- and y-labels
plt.xlabel("Letter")
plt.ylabel("Frequency")
# Add a legend
plt.legend()
# Display plot
plt.show()It looks like Fred Frequentist is the kidnapper. Both the ransom and Fred have low frequencies of H and high frequency of P.
5 Different Types of Plots
In this final chapter, you’ll learn how to create three new plot types: scatter plots, bar plots, and histograms. You’ll use these tools to locate where the kidnapper is hiding and rescue Bayes, the Golden Retriever.
5.1 Lecture: Making a Scatter Plot
5.2 Charting Cellphone Data
We know that Freddy Frequentist is the one who kidnapped Bayes the Golden Retriever. Now we need to learn where he is hiding.
Our friends at the police station have acquired cell phone data, which gives some of Freddie’s locations over the past three weeks. It’s stored in the DataFrame cellphone. The x-coordinates are in the column x and the y-coordinates are in the column y.
url = 'https://raw.githubusercontent.com/QuanNguyenIU/QuanNguyenIU.github.io/main/DataCamp/Python/Intro.%20to%20Data%20Science%20in%20Python/cellphone.csv'
cellphone = pd.read_csv(url)
# Explore the data
cellphone.head()## x y
## 0 28.136519 39.358650
## 1 44.642131 58.214270
## 2 34.921629 42.039109
## 3 31.034296 38.283153
## 4 36.419871 65.971441
# Create a scatterplot
plt.scatter(cellphone.x, cellphone.y)
# Add labels
plt.ylabel('Latitude')
plt.xlabel('Longitude')
# Display the plot
plt.show()Great job! Next, we’ll use keyword arguments to make this plot a little bit prettier.
5.3 Modifying a Scatterplot
Previously, we created a scatter plot to show Freddy Frequentist’s cell phone data.
Now, we will do some magic so that the plot will appear over a map of our town. If we just plot the data as we did before, we won’t be able to see the map or pick out the areas with the most points. We can fix this by changing the colors, markers, and transparency of the scatter plot.
import PIL
import urllib
url = 'https://raw.githubusercontent.com/QuanNguyenIU/QuanNguyenIU.github.io/main/DataCamp/Python/Intro.%20to%20Data%20Science%20in%20Python/town_map.png'
town_map = np.array(PIL.Image.open(urllib.request.urlopen(url)))
plt.scatter(cellphone.x, cellphone.y, color = 'red',
marker = 's', alpha = 0.1)
plt.imshow(town_map,
extent = [min(cellphone.x), max(cellphone.x),
min(cellphone.y), max(cellphone.y)])
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.show()Great job! Freddy has been spending a lot of time in Blue Meadows Park, Happy Mountain Trailhead, and Shady Groves Campsite.
5.4 Lecture: Making a Bar Chart
5.5 Build a Simple Bar Chart
Officer Deshaun wants to plot the average number of hours worked per week for him and his coworkers. He has stored the hours worked in a DataFrame called hours.
url = 'https://raw.githubusercontent.com/QuanNguyenIU/QuanNguyenIU.github.io/main/DataCamp/Python/Intro.%20to%20Data%20Science%20in%20Python/hours.csv'
hours = pd.read_csv(url)
hours## officer avg_hours_worked std_hours_worked
## 0 Deshaun 45 3
## 1 Mengfei 33 9
## 2 Aditya 42 5
# Create a bar plot from the DataFrame hours
plt.bar(hours.officer, hours.avg_hours_worked,
# Add error bars
yerr = hours.std_hours_worked)
# Display the plot
plt.show()Excellent! Let’s keep investigating and see how each officer was spending his or her time.
5.6 Where did the Time Go?
Officer Deshaun wants to compare the hours spent on field work and desk work between him and his colleagues. In this DataFrame, he has split out the average hours worked per week into desk_work and field_work.
url = 'https://raw.githubusercontent.com/QuanNguyenIU/QuanNguyenIU.github.io/main/DataCamp/Python/Intro.%20to%20Data%20Science%20in%20Python/hours2.csv'
hours = pd.read_csv(url)
# Plot the number of hours spent on desk work
plt.bar(hours.officer, hours.desk_work, label = 'Desk Work')
# Plot the hours spent on field work on top of desk work
plt.bar(hours.officer, hours.field_work,
bottom = hours.desk_work,label = "Field Work")
# Add a legend
plt.legend()
# Display the plot
plt.show()Wonderful! It looks like Officer Aditya spent the most amount of time on field work.
5.7 Lecture: Making a Histogram
5.8 Modifying Histograms
Let’s explore how changes to keyword parameters in a histogram can change the output. Recall that:
- range sets the minimum and maximum datapoints that we will include in our histogram;
- bins sets the number of points in our histogram.
We’ll be exploring the weights of various puppies from the DataFrame puppies.
url = 'https://raw.githubusercontent.com/QuanNguyenIU/QuanNguyenIU.github.io/main/DataCamp/Python/Intro.%20to%20Data%20Science%20in%20Python/puppies.csv'
puppies = pd.read_csv(url)
# Create a histogram of the column weight from the DataFrame puppies
plt.hist(puppies.weight)
# Add labels
plt.xlabel('Puppy Weight (lbs)')
plt.ylabel('Number of Puppies')
# Display
plt.show()# Change the number of bins to 50
plt.hist(puppies.weight, bins = 50)
# Add labels
plt.xlabel('Puppy Weight (lbs)')
plt.ylabel('Number of Puppies')
# Display
plt.show()# Change the range to start at 5 and end at 35
plt.hist(puppies.weight, range = (5, 35))
# Add labels
plt.xlabel('Puppy Weight (lbs)')
plt.ylabel('Number of Puppies')
# Display
plt.show()Great job! Increasing the number of bins made your plot spikier. Changing the range restricted the portion of the dataset that was plotted.
5.9 Heroes with Histograms
We’ve identified that the kidnapper is Fred Frequentist. Now we need to know where Fred is hiding Bayes.
A shoe print at the crime scene contains a specific type of gravel. Based on the distribution of gravel radii, we can determine where the kidnapper recently visited.
url = 'https://raw.githubusercontent.com/QuanNguyenIU/QuanNguyenIU.github.io/main/DataCamp/Python/Intro.%20to%20Data%20Science%20in%20Python/gravel.csv'
gravel = pd.read_csv(url)
# Create a histogram
plt.hist(gravel.radius, bins = 40, range = (2, 8), density = True)
# Label plot
plt.xlabel('Gravel Radius (mm)')
plt.ylabel('Frequency')
plt.title('Sample from Shoeprint')
# Display histogram
plt.show()Based on these images, where is Fred Frequentist hiding?
6 Course Recap
Congratulations on completing the course! More courses, tracks and instructions can be found here. Happy learning!